Note: An html version of this document, which includes the evaluated result of all code chunks, can be found at https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html.

startTime <- proc.time()
startSysTime <- Sys.time()
setupEndTime <- proc.time()

Preface: crc-version branch

This document was created to test the performance of the Center for Research Computing high-performance computing cluster, and to save the data for later. Because Pitt’s HPC cluster setup differs from the University of Canterbury HPC cluster on which this document was originally run, the output data in this document isn’t identical to that of the original. Information on cluster performance (i.e., timing) can be found below. This repo also contains the Slurm shell script that was used to run this document.

Introduction

This guide walks readers through the process of training a random-forest classifier for automated coding of sociophonetic variation, using the statistical computing language R (R Core Team, 2018) and the packages ranger (Wright & Ziegler, 2017) and caret (Kuhn, 2018). This guide is intended to be a companion to our manuscript “From categories to gradience: Auto-coding sociophonetic variation with random forests”, and it recreates the classifier of Southland English non-prevocalic /r/ discussed in the manuscript. By consulting the R Markdown file that generated this document, readers can run the code on their own systems to recreate the /r/ classifier.

IMPORTANT: While you can run all this code, we don’t recommend trying to run it all at once (e.g., by ‘knitting’ the R Markdown document); the code that generated this document was tailored to a machine with 32 processors and still takes several hours to run start-to-finish. As we mention below, training a classifier is not a one-size-fits-all process.

This guide goes through the following steps:

  1. Pre-process data

  2. Determine classifier performance measure

  3. Train an initial classifier

  4. Tune hyperparameters

  5. Determine effect of outliers

  6. Train final classifier

  7. Auto-code new tokens and extract classifier probabilities for hand-coded tokens

Perhaps the most important thing to know is that training a classifier is not a one-size-fits-all process. Your decisions about how to go about training a classifier will change depending on your data, your variables, your research questions, your standards for what counts as adequate performance, etc. This guide describes the process for one particular set of data, a particular variable, particular research questions, etc., and shouldn’t be taken as an exact template for every possible case.

Implementation notes

This guide is intended to be accessible to intermediate to advanced R users. Some of the coding techniques used in this guide are fairly advanced, but you won’t need to understand all the code to use this guide or to train your own classifiers.

This guide also includes text that’s visible only on the R Markdown version of the document (to avoid cluttering the html version) giving more detail about the implementation of the code. Like this paragraph, R Markdown-only text in this document begins with a ‘greater than’ sign.

The steps in this guide can be used by any computer that can run R. That said, progressing through all of the steps listed here can be very time-consuming, especially with large amounts of data. As a result, all of the forests in this guide were generated on a computing cluster running Ubuntu 16.04 with 32 virtual cores and 64 GB of memory, in order to leverage R’s support for parallel computing. (Despite this powerful hardware, the code in this guide takes over 4 hours to run in terms of pure running time.) The code used here for running models in parallel (which uses the package doParallel) should work on all systems, although the user will need to adjust the number of cores to their system (you can find out how many cores are available via the detectCores() function in the parallel package). Note that train() (the function we use to train classifiers in this guide) by default uses a parallel backend when one is available (e.g., to train multiple resamples at a time), so parallelization is used even in instances where the end-user doesn’t explicitly invoke it using doParallel methods.

Since random forests are a stochastic process, results can be slightly different with each run, so this guide uses random seeds to ensure reproducibility. In single forest runs, it suffices to use set.seed() prior to running train(). In situations where doParallel methods are used to run a list of forests differing by a hyperparameter, it’s necessary to use clusterSetRNGStream() after registering the cluster and before calling foreach(); this ensures that each core starts with the same random seed. Finally, in Step 5 when we want to run lists of non-identical forests that don’t differ by any parameters, we use the seeds argument of trainControl().

Finally, our code creates ‘dummy files’, with no extension or contents, in a Model Status subfolder to help us keep track of when the sub-models finish running. This technique is especially useful when you expect some sub-models to take longer than others (e.g., repeated k-fold cross-validation will naturally take longer than non-repeated k-fold cross-validation). Although we’re only saving ‘dummy files’ here, you can also save more useful files; for example, if the overall model is too big to store in memory and the model crashes before all sub-models are complete, you can replace the second ‘dummy file’ with a command to save each sub-model when it’s finished running or a summary of the sub-model.

if (!dir.exists("Model Status")) dir.create("Model Status")

Step 0: Get hand-coded data and load required packages

To run the code in this guide, you’ll need R and several R packages. In particular, this code relies heavily on the collection of packages called the tidyverse, which make R code faster, more readable, and safer. Though you can run this code without it, we recommend the RStudio integrated development environment for ease of understanding code. Information on the system, R version, and package versions that generated this guide can be found at the end of this guide.

As a prerequisite to the steps we describe in this guide, you’ll also need data that has been hand-coded into two or more variants and for which you have all the acoustic measures you’ll want to enter into the classifier. In the case of our /r/ classifier (for which we provide the data as an example), we had several thousand hand-coded tokens, for which we extracted 180 acoustic measures using the “Process with Praat” function in LaBB-CAT (Fromont & Hay, 2012); these measures were motivated by previous literature on the acoustic correlates of rhoticity in English and other languages (e.g., Lawson, Stuart-Smith, & Scobbie, 2018; Heselwood, 2009). More details on the choice of these measures can be found in the main text.

Load the /r/ data, which includes 40614 rows and 217 columns, the last 180 of which are acoustic measures to be fed into the classifier.

The first 37 columns contain mostly linguistic information, plus LaBB-CAT metadata (TokenNum and MatchId) and speaker data (Speaker, which has been anonymized, and Gender). To preserve anonymity, both Speaker and Word have been anonymized to dummy factors. Stress is coded as either ’ (primary word stress), " (secondary word stress), or 0 (unstressed). Syllable and columns ending in “DISC” are phonemic strings represented in the DISC phonetic alphabet used by CELEX (see https://catalog.ldc.upenn.edu/docs/LDC96L14/eug_let.pdf); FollSegmentNoPause and FollSegment are represented in the ARPAbet codes used by the CMU pronouncing dictionary, without stress markers on vowels (see http://www.speech.cs.cmu.edu/cgi-bin/cmudict); and Vowel is represented in Wells lexical sets as they relate to New Zealand English (see https://www.phon.ucl.ac.uk/home/wells/stanlexsets.htm). HowCoded is used to separate tokens that have already been hand-coded from those to be auto-coded (which therefore have NA for Rpresent). All other columns should be self-explanatory.

trainingData <- readRDS("Data/RClassifierData_03July2019.Rds")
trainingData %>% dim()
[1] 40614   217
trainingData %>% colnames()
  [1] "TokenNum"           "Speaker"            "Gender"            
  [4] "MatchId"            "Stress"             "FollPause"         
  [7] "CelexFreq"          "FollPauseDur"       "Syllable"          
 [10] "SyllStart"          "SyllEnd"            "Word"              
 [13] "WordStart"          "WordEnd"            "TokenDISC"         
 [16] "TokenStart"         "TokenEnd"           "TokenGenAmDISC"    
 [19] "TokenGenAmStart"    "TokenGenAmEnd"      "Rpresent"          
 [22] "FollSegmentDISC"    "RType"              "PrevRInWord"       
 [25] "ContFuncWord"       "CorpusFreq"         "Vowel"             
 [28] "NURSEvsNonNURSE"    "FollSegmentNoPause" "FollSegment"       
 [31] "SyllFinal"          "WordFinal"          "HowCoded"          
 [34] "Initial_F1_50"      "Initial_F2_50"      "Initial_F3_50"     
 [37] "Initial_F4_50"      "TokenDur"           "F1_20"             
 [40] "diffF3F1_20"        "BW1_20"             "F1_25"             
 [43] "diffF3F1_25"        "BW1_25"             "F1_30"             
 [46] "diffF3F1_30"        "BW1_30"             "F1_35"             
 [49] "diffF3F1_35"        "BW1_35"             "F1_40"             
 [52] "diffF3F1_40"        "BW1_40"             "F1_45"             
 [55] "diffF3F1_45"        "BW1_45"             "F1_50"             
 [58] "diffF3F1_50"        "BW1_50"             "F1_55"             
 [61] "diffF3F1_55"        "BW1_55"             "F1_60"             
 [64] "diffF3F1_60"        "BW1_60"             "F1_65"             
 [67] "diffF3F1_65"        "BW1_65"             "F1_70"             
 [70] "diffF3F1_70"        "BW1_70"             "F1_75"             
 [73] "diffF3F1_75"        "BW1_75"             "F1_80"             
 [76] "diffF3F1_80"        "BW1_80"             "F1min"             
 [79] "F1max"              "time_F1min"         "time_F1max"        
 [82] "F1range"            "time_F1range"       "slopeF1"           
 [85] "F2_20"              "diffF3F2_20"        "BW2_20"            
 [88] "F2_25"              "diffF3F2_25"        "BW2_25"            
 [91] "F2_30"              "diffF3F2_30"        "BW2_30"            
 [94] "F2_35"              "diffF3F2_35"        "BW2_35"            
 [97] "F2_40"              "diffF3F2_40"        "BW2_40"            
[100] "F2_45"              "diffF3F2_45"        "BW2_45"            
[103] "F2_50"              "diffF3F2_50"        "BW2_50"            
[106] "F2_55"              "diffF3F2_55"        "BW2_55"            
[109] "F2_60"              "diffF3F2_60"        "BW2_60"            
[112] "F2_65"              "diffF3F2_65"        "BW2_65"            
[115] "F2_70"              "diffF3F2_70"        "BW2_70"            
[118] "F2_75"              "diffF3F2_75"        "BW2_75"            
[121] "F2_80"              "diffF3F2_80"        "BW2_80"            
[124] "F2min"              "F2max"              "time_F2min"        
[127] "time_F2max"         "F2range"            "time_F2range"      
[130] "slopeF2"            "F3_20"              "BW3_20"            
[133] "F3_25"              "BW3_25"             "F3_30"             
[136] "BW3_30"             "F3_35"              "BW3_35"            
[139] "F3_40"              "BW3_40"             "F3_45"             
[142] "BW3_45"             "F3_50"              "BW3_50"            
[145] "F3_55"              "BW3_55"             "F3_60"             
[148] "BW3_60"             "F3_65"              "BW3_65"            
[151] "F3_70"              "BW3_70"             "F3_75"             
[154] "BW3_75"             "F3_80"              "BW3_80"            
[157] "F3min"              "F3max"              "time_F3min"        
[160] "time_F3max"         "F3range"            "time_F3range"      
[163] "slopeF3"            "intens_F3min"       "intens_F3max"      
[166] "F4_20"              "diffF4F3_20"        "BW4_20"            
[169] "F4_25"              "diffF4F3_25"        "BW4_25"            
[172] "F4_30"              "diffF4F3_30"        "BW4_30"            
[175] "F4_35"              "diffF4F3_35"        "BW4_35"            
[178] "F4_40"              "diffF4F3_40"        "BW4_40"            
[181] "F4_45"              "diffF4F3_45"        "BW4_45"            
[184] "F4_50"              "diffF4F3_50"        "BW4_50"            
[187] "F4_55"              "diffF4F3_55"        "BW4_55"            
[190] "F4_60"              "diffF4F3_60"        "BW4_60"            
[193] "F4_65"              "diffF4F3_65"        "BW4_65"            
[196] "F4_70"              "diffF4F3_70"        "BW4_70"            
[199] "F4_75"              "diffF4F3_75"        "BW4_75"            
[202] "F4_80"              "diffF4F3_80"        "BW4_80"            
[205] "F4min"              "F4max"              "time_F4min"        
[208] "time_F4max"         "F4range"            "time_F4range"      
[211] "slopeF4"            "F0min"              "F0max"             
[214] "F0rangeST"          "time_F0min"         "time_F0max"        
[217] "absSlopeF0"        

Step 1: Pre-process data: Normalization, missing value imputation, outlier marking

Due to a fair amount of measurement error in automatically extracting formant and pitch measurements, and to account for the degree to which formant frequencies are impacted by vocal tract length, we subjected /r/ measurements to pre-processing before entering data into the classifier. Your data might not need all these steps—we didn’t need to do nearly as much pre-processing for the medial /t/ data—but we’ll walk through them for your reference.

First, we removed tokens that either had bad durations or immediately preceded another /r/. Checking durations is a way to filter out tokens whose forced alignment is bad, since individual /r/ tokens are unlikely to be long (except perhaps in word lists or utterance-finally). We randomly sampled tokens in a variety of duration bands (e.g., [500 ms, 550 ms)) to check for misalignment; this procedure led to a ceiling of 450 ms. A handful of tokens somehow had zero or negative durations, so we removed these tokens, as well.

trainingData <-
  trainingData %>% 
  filter(TokenDur > 0, TokenDur < 0.45, FollSegment!="R")

Second, we normalized F1–F4 measurements (timepoint measurements and maxima/minima) by subtracting the speaker’s mean word-initial /r/ midpoint measurement for that formant from the raw measurement. As with vowels, formant measurements for rhotics are affected by vocal tract length, but whereas normalization methods abound for vowel formant frequencies, there is no generally accepted way to normalize formants in rhotics. Our method of using each speaker’s initial /r/ formants as a baseline measure accounts for vocal tract length, following Hay and Clendon’s (2012) approach of modeling rhoticity by comparing non-prevocalic /r/ F3 against initial /r/ F3.

trainingData <-
  trainingData %>% 
  mutate_at(vars(matches("^F1(_[2-8][50]|min|max)$")), ~ . - Initial_F1_50) %>% 
  mutate_at(vars(matches("^F2(_[2-8][50]|min|max)$")), ~ . - Initial_F2_50) %>% 
  mutate_at(vars(matches("^F3(_[2-8][50]|min|max)$")), ~ . - Initial_F3_50) %>% 
  mutate_at(vars(matches("^F4(_[2-8][50]|min|max)$")), ~ . - Initial_F4_50)

Third, we centered and scaled token duration by speaker and vowel; in cases where token duration was invariant within a speaker \(\times\) vowel distribution, we used a value of 0.

trainingData <-
  trainingData %>% 
  group_by(Speaker, Vowel) %>% 
  mutate(TokenDurSD = sd(TokenDur, na.rm=TRUE) %>% replace_na(0),
         TokenDur = if_else(TokenDurSD > 0, scale(TokenDur), 0)) %>% 
  select(-TokenDurSD) %>% 
  ungroup()

Fourth, to deal with tokens that had missing pitch and/or F4 measurements (including F4 bandwidth), we imputed missing F0 and F4 measurements using k Nearest Neighbor imputation (with k = 10) via the function knnImputation() in the package DMwR (Torgo, 2010). Since random forests cannot classify observations with missing measurements, the imputation of missing measurements is common practice in machine learning contexts (e.g., Breen, Thomas, Baldwin, & Lipinska, 2019; Hudak, Crookston, Evans, Hall, & Falkowski, 2008). At the same time, this step won’t necessarily apply to every data set (we didn’t need to impute any measurements for the medial /t/ data). Note that this next step can be fairly time-consuming (on the cluster, it takes around 15 minutes).

For time-consuming steps such as imputation, we use caching to save time by avoiding re-running code. If we give the chunk a unique name and set cache=params$testing, R Markdown will save the results of the code chunk to a cache when initially knitting the file, such that on subsequent knits the results will simply be re-loaded. This can save a considerable amount of time, especially if you want to tweak text that’s outside of the cached chunks or plots that use data generated by cached chunks, etc. More info on caching can be found at https://yihui.name/knitr/demo/cache/ and in the knitr manual at https://yihui.name/knitr/demo/manual/.

##Capture current column ordering
origCols <- trainingData %>% colnames()

##Set up dataframe for imputation: Must contain only the measures to be imputed 
##  and the measures being used to calculate k nearest neighbors.
imputeDF <-
  trainingData %>%
  ##In this data, MatchId serves as a unique identifier for tokens
  select(MatchId, TokenDur:absSlopeF0) %>% 
  select(-contains("F0"), -contains("F4"), -contains("BW4")) %>%
  select_if(~ sum(is.na(.)) < 300) %>% 
  filter(complete.cases(.)) %>% 
  left_join(trainingData %>% 
              select(MatchId, TokenDur:absSlopeF0) %>% 
              select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
            by="MatchId")

##Perform imputation--this takes a while!
imputeDF <- 
  imputeDF %>% 
  select(-MatchId) %>% 
  as.data.frame() %>% 
  knnImputation() %>% 
  cbind(imputeDF %>% select(MatchId))

##Add imputed values back into data frame (MatchId is a unique identifier, so
##  we can use it to match tokens between data frames)
trainingData <-
  trainingData %>% 
  left_join(imputeDF %>% 
              select(MatchId, contains("F0"), contains("F4"), contains("BW4")), 
            by="MatchId") %>% 
  select(-ends_with(".x")) %>% 
  rename_at(vars(ends_with(".y")), str_remove, ".y") %>% 
  select(!!origCols) 

##Clean up
rm(imputeDF)

Fifth, we marked numerous measures for potential outlier-hood: formant timepoint measures, formant minima/maxima, pitch timepoint measures, pitch minima/maxima, intensity measures, and duration. Potential outliers were measurements that were more than 1.5 \(\times\) the interquartile range (third quartile minus first quartile) outside the first or third quartile for that measure’s distribution by speaker and by vowel. We won’t actually exclude outliers now; instead, we’ll use outlier data in Step 5 below to figure out which outliers we can afford to keep, and which ones we’re better off discarding. As with imputation, outlier-marking isn’t necessary or appropriate for all situations; we give more details in the subsection following this one.

trainingData <-
  trainingData %>%
  group_by(Speaker, Vowel) %>% 
  mutate_at(vars(matches("^F[01234](_[2-8][50]|min|max)$"), starts_with("intens_"), TokenDur), 
            list(Outlier = ~ . < quantile(., 0.25, na.rm=TRUE) - 1.5*IQR(., na.rm=TRUE) |
                   . > 1.5*IQR(., na.rm=TRUE) + quantile(., 0.75, na.rm=TRUE))) %>% 
  ungroup()

Finally, we excluded tokens that had missing measurements for measures other than F0 and F4.

trainingData <-
  trainingData %>% 
  filter_at(vars(TokenDur:absSlopeF0), all_vars(!is.na(.))) %>% 
  droplevels()

Let’s save our progress by writing the training data to an RDS file.

saveLoadStartTime <- proc.time()

if (!dir.exists("Data")) dir.create("Data")
trainingData %>% saveRDS("Data/trainingData.Rds")
trainingData <- readRDS("Data/trainingData.Rds")

saveLoadEndTime <- proc.time()

Do we even need to mark outliers?

Measurement error is a fact of life when using corpus methods, and of course classifiers will perform better if they have cleaner data. At the same time, not all data is appropriate for outlier-marking. Here are the conditions under which outlier-marking is appropriate:

  1. Some measures include measurement error. If your data is spotless, then any outliers are ‘true’ outliers, not just mis-measurements, and your model will be better off for being able to see them.

  2. For the measures with some measurement error, the distributions are either unimodal or their bi-/multimodality isn’t due to the classes of interest.

Let’s briefly investigate these conditions. First, do we have evidence of measurement error in our data? Looking at F3 minimum, we appear to have a good distribution, but the extent of it seems questionable (do we really have tokens whose F3 minimum is below 1000 Hz?). Similarly, the intensity at the time of the F3 minimum has some values that seem conspicuously low. When we hand-check these tokens, we find that these measurements are indeed erroneous, meaning there’s evidence for measurement error in our data:

trainingData %>% 
  filter(!is.na(F3min)) %>% 
  mutate(F3min_orig = F3min + Initial_F3_50) %>% 
  ggplot(aes(x=F3min_orig)) +
  geom_histogram(binwidth=50) +
  xlab("F3 minimum (Hz)")

trainingData %>% 
  filter(!is.na(intens_F3min)) %>% 
  ggplot(aes(x=intens_F3min)) +
  geom_histogram(binwidth=1) +
  xlab("Intensity at time of F3 minimum (dB)")

Once we’ve detected measurement error, we have to determine whether these measures are unimodally distributed. Rather than just eyeball the distributions, we’ll conduct Hartigan’s dip tests, which determine whether there is significant evidence for non-unimodality in a distribution (Freeman & Dale, 2013; Hartigan & Hartigan, 1985), using the diptest package (Maechler, 2016). The null hypothesis in these tests is that the distribution is unimodal, so a significant result means that we have evidence that the distribution is bimodal or multimodal. The dip test on the F3 minimum distribution indicates no evidence for non-unimodality, but (perhaps surprisingly) the dip test on the intensity at F3 minimum distribution indicates significant evidence for non-unimodality. In other words, F3 minimum has met the conditions for outlier-marking, but intensity hasn’t.

trainingData$F3min %>% dip.test()

    Hartigans' dip test for unimodality / multimodality

data:  .
D = 0.0010115, p-value = 0.9962
alternative hypothesis: non-unimodal, i.e., at least bimodal
trainingData$intens_F3min %>% dip.test()

    Hartigans' dip test for unimodality / multimodality

data:  .
D = 0.021, p-value < 2.2e-16
alternative hypothesis: non-unimodal, i.e., at least bimodal

The reason for the bimodality condition is that we don’t want to outlier-mark a bimodal distribution that is merely the combinaton of two unimodal distributions that each correspond to one class. If underlying the distribution of some measure are two fairly separate unimodal distributions by class, then outlier-marking is going to eliminate some measurements that are extreme with respect to the pooled distribution but are actually well-behaving measurements for their own class. As a result, if our bimodal distribution isn’t the result of the classes of interest, but instead the result of some other factor external to the classes, then it’s okay to outlier-mark the distribution (although it might be beneficial to normalize for the external factor if that’s possible). To that end, we can look at the by-class distribution of intensity; since we don’t have evidence for different modes for Absent vs. Present, intensity meets the conditions for outlier-marking.

trainingData %>% 
  filter(HowCoded=="Hand") %>% 
  ggplot(aes(x=intens_F3min, fill=Rpresent)) +
  geom_density(alpha=0.5) +
  xlab("Intensity at time of F3 minimum (dB)")

We don’t have any examples of distributions whose bimodality is due to the Absent/Present in the /r/ data, but in the medial /t/ data, the center of gravity at 70% duration is clearly bimodal, and this is clearly due to Voiced tokens having a lower center of gravity than Voiceless tokens.

Plot: Medial /t/ center of gravity at 70% by class

Plot: Medial /t/ center of gravity at 70% by class

Now that we’ve determined that we can outlier-mark these measures, what criteria do we use to mark individual measurements as outliers? Two potential rules of thumb are to mark measurements that are more than 2 standard deviations from the mean, or to mark measurements that are more than 1.5 times the interquartile range from the first or third quartiles. The standard deviation-based rule is only appropriate if the data is normal, while the IQR-based rule is more general. In this case, since we had a lot of measures to outlier-mark, some of which are clearly non-normal (e.g., intensity at F3 minimum), we used the more general IQR rule of thumb. We also generated distributions within each speaker and vowel, since a well-behaving measurement in a rare vowel context can easily look like an outlier in the overall data.

Step 2: Determine classifier performance metric to optimize

Before tuning hyperparameters, it’s necessary to choose a classifier performance metric that the process of hyperparameter tuning will optimize. If you simply want the most accurate classifier possible, optimize for overall accuracy; if it’s overwhelmingly important that a particular class be accurately represented, optimize for that class’s accuracy. In our case, we wanted our classifiers to be highly accurate, but we also wanted our classifiers to balance true positives and true negatives; the latter aim was especially important in light of the class imbalance between Present and Absent in the /r/ training data. As a result, we tuned hyperparameters to optimize the area under the ROC curve (AUC) \(\times\) overall accuracy (see the Methods section of the main text). We also wanted to monitor the class accuracies of Present and Absent during the tuning process, even though we didn’t choose hyperparameters to optimize class accuracies.

The package that we used to run the classifiers, caret (Kuhn, 2018), allows the user to specify custom summary functions to report summary statistics, which is useful for hyperparameter tuning. We used the following function (a lightly modified version of caret’s multiClassSummary()) to report several summary statistics, including overall accuracy \(\times\) AUC.

cutoffSummary <- function(data, lev=NULL, model=NULL, obsCol="obs", returnDF=FALSE) {
  require(ROCR)
  require(tidyverse)
  require(magrittr)
  
  if (!is.null(obsCol) & obsCol!="obs") data <- data %>% rename("obs" = obsCol)
  
  preds <- data %$% prediction(Present, obs)
  auc <-
    preds %>%
    performance(measure="auc") %>%
    attr("y.values") %>%
    extract2(1)
  
  cutoff <-
    preds %>%
    performance(measure="acc") %>%
    attributes() %>%
    extract(c("x.values", "y.values")) %>%
    map_dfc(extract2, 1) %>%
    filter(y.values==max(y.values)) %>%
    pull(x.values) %>%
    median()
  
  data <- data %>%
    mutate(PredCutoff = if_else(Present >= cutoff, "Present", "Absent"),
           PredHalf = if_else(Present >= 0.5, "Present", "Absent"),
           AccCutoff = PredCutoff==obs,
           AccHalf = PredHalf==obs)
  
  class_means <-
    data %>%
    group_by(obs) %>%
    summarise_at(vars(starts_with("Acc")), mean) %>%
    gather(Type, Acc, starts_with("Acc")) %>%
    mutate(Type = str_c(Type, "_", obs)) %>%
    select(Type, Acc) %>%
    spread(Type, Acc) %>%
    rename_all(str_replace, "Acc", "ClassAccuracy") %>%
    rename_all(str_remove, "Cutoff")
  
  # confMat <- data %$% table(Predicted = PredCutoff, Actual = obs)
  
  ret <- data.frame(AccAUC = mean(data$AccCutoff)*auc,
                    AUC = auc,
                    Accuracy = mean(data$AccCutoff),
                    BestCutoff = cutoff,
                    AccuracyHalf = mean(data$AccHalf)) %>%
    cbind(class_means)
  
  if (!returnDF) ret <- ret %>% map_dbl(I)
  return(ret)
}

Step 3: Train an initial classifier

First, we train an initial classifier using the default settings of train(). This initial classifier will help us understand what sort of performance we can expect from our final model; although we’ll tune hyperparameters and exclude outliers to improve the model’s performance, in reality the final model’s performance won’t be drastically better than the initial model.

In terms of syntax, we feed train() a formula, the data, a model type, a variable importance mode, and a trainControl object that contains some more details about how the model is to be run:

  • The formula describes the dependent variable we’re trying to classify (Rpresent) and says that everything else in the data (.) will serve as independent variables.

  • For the data, we use just the hand-coded tokens from the training data (recall that we’ve already filtered out tokens with missing measurements, etc., in Step 1), and just the columns containing the dependent variable and the acoustic measures that we want to use in training our classifier (hence why the formula has . on the right-hand side).

  • We specify the "ranger" method for running random forests because train() can use over 200 models for classification and regression.

  • The variable importance mode will determine how ranger() ranks variables in terms of importance (here, "impurity" is the Gini index).

  • Finally, the various arguments of trainControl() tell train() to save the resampled classification predictions that the model generates for the hand-coded data, use our custom summary function (so we can use a non-default cutoff and calculate our chosen performance measure, overall accuracy \(\times\) AUC), save info on which tokens were in which resamples, and calculate not only binary classifications but also classifier probabilities (which is necessary for calculating AUC).

Because we haven’t specified anything else (e.g., number of trees, resampling method), train() will use its defaults for the various hyperparameters we’ll tune later.

fstInitStartTime <- proc.time()
##Run the model using a single core
registerDoSEQ()
##RF are a stochastic process, so ensure reproducibility by setting a seed
set.seed(302302)

fstInit <-
  train(Rpresent ~ .,
        data=trainingData %>% 
          filter(HowCoded=="Hand") %>% 
          select(Rpresent, TokenDur:absSlopeF0),
        method="ranger",
        importance="impurity",
        trControl=trainControl(savePredictions=TRUE,
                               summaryFunction=cutoffSummary,
                               returnResamp="all",
                               classProbs=TRUE))
Loading required package: ROCR
fstInitEndTime <- proc.time()

The fstInit model we’ve just generated has lots of information inside of it. To pull out the most salient information from the model—the resampled performance results, an average confusion matrix, the variable importance scores, and timing information—we’ll use a summary function. This summary function will also come in handy because the forest models are large files (fstInit on its own is 80.6 Mb), and they’ll add up when we’re generating large lists of them in Steps 4 and 5, so we’ll just save the summaries to disk rather than the forests.

To the last point about only saving the summaries: if we’re caching the chunks that are used to run the forests, the forests will be saved to disk as well; cache with caution. This file caches the forest-running chunks on a computing cluster with lots of extra space, but if you don’t have lots of extra space to use up on your drive, we wouldn’t recommend caching forests (especially if you set random seeds to ensure that you’d get the same results every time); the forests run in this document take up about 4 Gb.

fstSummary <- function(x) {
  require(tidyverse)
  
  ##Create a simple summary with resampled performance measures, confusion
  ##  matrix, variable importance, and timing info
  simpleSummary <- function(x) {
    resample <- 
      x %>% 
      extract2("resample") %>% 
      select(AccAUC, Accuracy, AUC, everything())
    cutoff <- 
      resample %>% 
      pull(BestCutoff) %>% 
      mean(na.rm=TRUE)
    confMat <- 
      x %>% 
      extract2("pred") %>% 
      mutate(pred = if_else(Present >= cutoff, "Present", "Absent")) %$% 
      table(Predicted = pred, Actual = obs)
    varImp <- 
      x %>% 
      extract2("finalModel") %>% 
      extract2("variable.importance")
    times <- x %>% 
      extract2("times") %>% 
      extract(-3)
    
    list(resample = resample, confMat = confMat, 
         varImp = varImp, times = times)
  }
  
  ##If x is a single forest, return a simple summary; if x is a list of
  ##  forests, make each summary element a list of results for each forest
  ##  (except for resample, which combines the results for all forests and adds
  ##  a Model column)
  if ("train" %in% class(x)) {
    smry <- simpleSummary(x)
  } else if (x %>% map_lgl(~ "train" %in% class(.x)) %>% all()) {
    if (x %>% names() %>% is.null()) names(x) <- seq_along(x)
    smry <- 
      x %>% 
      map(simpleSummary) %>% 
      transpose()
    smry$resample <- 
      smry$resample %>% 
      imap_dfr(~cbind(Model = .y, .x, stringsAsFactors=FALSE)) %>% 
      mutate_at(vars(Model), factor, levels=names(x))
  }
  
  smry
}

For example, we can calculate the mean of resampled performance measures from the resample element of our summary object. Here, we see that the mean overall accuracy \(\times\) AUC of this initial classifier was 0.6972862:

smryInit <- fstInit %>% fstSummary()
smryInit$resample %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE)

Or we can look at timing, in the times element. Here, we see that the classifier took 10 minutes to run, but if we’d run it on a single core rather than with a parallel backend, it would’ve taken 203 minutes to run:

smryInit$times
$everything
    user   system  elapsed 
8970.570   13.794  894.277 

$final
   user  system elapsed 
 43.540   0.023   4.218 

Of course, this is just one way to summarize a forest; depending on your needs, you might want to pull out different information. Regardless, we’ll save this summary (and not the forest itself) to the Model Status directory, and we’ll remove the forest from our workspace to save on memory:

smryInit %>% saveRDS("Model Status/SmryInit.Rds")
rm(fstInit)

Step 4: Tune hyperparameters

In this step, we adjust several hyperparameters to find the settings that will yield the best-performing model (with “best-performing” defined with respect to our chosen performance measure, of course). These hyperparameters control how the random forest classifier is run in ranger and how caret determines classifier performance. In particular, caret takes the training data and generates training and test subsets to calculate classifier performance measures, optionally performing additional resampling on these training and test subsets.

We tune the following hyperparameters:

  • Additional resampling to perform on the training and test subsets (trainControl(sampling))

  • The method for generating training and test subsets from the training set, for the purpose of measuring performance (the method argument of trainControl())

  • Parameters controlling the generation of training and test subsets, depending on the method: number of folds/resampling iterations (trainControl(number)), number of repeated k-fold cross-validation iterations (trainControl(repeats)), and/or leave-p-out cross-validation training percentage (trainControl(p))

  • Parameters of the ranger model (controlled via the tuneGrid argument of train()):

    • The number of variables to test at each node (mtry)

    • Node splitting rule (splitrule)

    • Minimum node size (min.node.size)

  • The number of trees to generate (train(num.trees))

Because we have unbalanced classes in our hand-coded data, this additional resampling is likely to have the greatest impact on classifier performance, hence why we tune this hyperparameter first.

Additional resampling

After constructing training and test subsets, train() can perform additional resampling to adjust for class imbalances in the training and test subsets prior to assessing classifier performance. By default, train() performs no additional resampling, but other options are downsampling the majority class to match the size of the minority class, and SMOTE sampling, a hybrid method that combines downsampling the majority class with creating synthetic members of the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Given that the data are imbalanced, it’s likely that a model that looks at the data as-is will perform well on the majority class, Absent, but poorly on the minority class, Present. For all the other hyperparameters, we start with default values (method="boot", number=25, num.trees=500, tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1)).

resampStartTime <- proc.time()
##Define a vector with possible values of our hyperparameter
tuneResamples <- c("none", "down", "smote")

##Register a parallel cluster with 3 cores (the length of our hyperparameter
##  vector)
cl <- makeCluster(tuneResamples %>% length())
registerDoParallel(cl)
##Set a parallel seed for reproducibility
clusterSetRNGStream(cl, 302302) 

##Train the list of forests; the code inside the curly braces will be run in
##  parallel for each value of the vector tuneResamples, outputting a classifier
##  for each setting
fstListResamples <-
  foreach(tuneRsmp = tuneResamples, 
          .packages=c("caret","dplyr")) %dopar% {
            ##Create an extensionless 'dummy file' to signal that the model has
            ##  started running
            file.create(paste("Model Status/Resamples", tuneRsmp, "begun", sep="_"))
            ##Call train()
            mod <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=500,
                         trControl=trainControl(method="boot", 
                                                number=25,
                                                sampling=if (tuneRsmp=="none") NULL else tuneRsmp,
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
            ##Create an extensionless 'dummy file' to signal that the model has
            ##  finished running
            file.create(paste("Model Status/Resamples", tuneRsmp, "completed", sep="_"))
            ##Return the classifier
            return(mod)
          } %>% 
  ##Name each classifier with its hyperparameter setting
  set_names(tuneResamples)

##Stop the parallel cluster
stopCluster(cl)
resampEndTime <- proc.time()

Clear the model status dummy files:

# dir("Model Status", pattern="^Resamples", full.names=TRUE) %>% file.remove()

Create and save a summary object for this tuning step (note that we’ve written fstSummary() to handle a list of classifiers as well as a single classifier):

smryResamples <- fstListResamples %>% fstSummary()
smryResamples %>% saveRDS("Model Status/SmryResamples.Rds")

The resampled performance measures indicate that SMOTE sampling narrowly outperforms no resampling, which outperforms downsampling.

smryResamples$resample %>% 
  group_by(Model) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  arrange(desc(AccAUC))

Because we tested each classifier using 25 bootstrap resamples, we can assess performance across a range of possible outcomes. The plot below shows the resampled overall accuracy \(\times\) AUC for each hyperparameter setting. The SMOTE and downsampling models are more variable in their performance than the non-resampled model; this is not surprising given that the process of resampling introduces an additional layer of randomness. Nevertheless, the SMOTE models generally outperform the downsampled and non-resampled models.

smryResamples$resample %>% 
  ggplot(aes(x=Model, y=AccAUC)) +
  geom_violin() +
  xlab("Additional resampling method") +
  ylab("Overall accuracy × AUC")

Even though we’re optimizing our classifiers based on overall accuracy \(\times\) AUC, our summary function allows us to keep an eye on other performance metrics we might care about, like class accuracies. The plot below demonstrates that the choice of additional resampling method makes a considerable impact on class accuracy. As we suspected, the non-resampled model suffers from stratification between high accuracy for Absent and low accuracy for Present; compared to the non-resampled model, the SMOTE model has slightly worse accuracy for Absent, but much better accuracy for Present. This is a tradeoff we’re willing to make, and this tradeoff is reflected in overall accuracy \(\times\) AUC, where the SMOTE model edges out the non-resampled model.

smryResamples$resample %>% 
  gather("Class", "ClassAccuracy", starts_with("ClassAccuracy_")) %>% 
  mutate_at(vars(Class), str_remove, "ClassAccuracy_") %>% 
  ggplot(aes(x=Model, y=ClassAccuracy, fill=Class)) +
  geom_violin() +
  xlab("Additional resampling method") +
  ylab("Class accuracy")

Again, caret automatically saves timing data in the random forest objects it creates, and the summary objects we’ve created store this timing information. Here, SMOTE takes substantially longer to run than no resampling or downsampling, thanks to the fact that it takes the extra step of creating synthetic minority class tokens. The time-saving benefits of parallel processing are also evident from the difference between user (the sum of running time on all cores) and elapsed (the actual time it took to run the models); it took 7 minutes to run the SMOTE model in parallel, but without parallel processing it would have taken nearly 25 minutes!

smryResamples$times
$none
$none$everything
   user  system elapsed 
354.512   2.235  67.174 

$none$final
   user  system elapsed 
 15.896   0.027   1.752 


$down
$down$everything
   user  system elapsed 
203.189   2.401  55.100 

$down$final
   user  system elapsed 
  8.273   0.068   1.834 


$smote
$smote$everything
    user   system  elapsed 
1141.469    6.938  395.779 

$smote$final
   user  system elapsed 
 50.734   0.195  13.869 

It’s also a good idea when you’re running these models to keep an eye on how much memory your forest lists are taking up by sitting in your R workspace. Here we see that our forest list with three different models takes up quite a bit of space. We’re running these forests on a platform where memory’s not a concern, but if it were, we’d certainly want to remove each forest list after extracting whatever summary information we want from the forest list and saving that summary to file.

fstListResamples %>% object.size() %>% format("Mb")
[1] "180.5 Mb"

Training/test subset generation method

The possible subset generation methods are the simple bootstrap estimator, the 0.632 bootstrap estimator, out-of-bag, k-fold cross-validation, repeated k-fold cross-validation, and leave-p-out cross-validation (aka leave-group-out cross-validation). Here, we use caret’s default parameter settings for each of these methods: 25 resamples for bootstrap, out-of-bag, and leave-p-out cross-validation; 10 folds for k-fold cross-validation and repeated k-fold cross-validation; and a training percentage of 75% for leave-p-out cross-validation. We’ll override one caret default, as we’ll use 5 repeats for repeated k-fold cross-validation (caret’s default of 1 repeat makes this method indistinguishable from non-repeated k-fold cross-validation).

Now, we generate our list of forests:

methodsStartTime <- proc.time()
tuneMethods <- c("boot", "boot632", "cv", "repeatedcv", "LGOCV")

cl <- makeCluster(tuneMethods %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListMethods <-
  foreach(tuneMth = tuneMethods, 
          .packages=c("caret","dplyr")) %dopar% {
            file.create(paste("Model Status/Methods", tuneMth, "begun", sep="_"))
            mod <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=500,
                         trControl=trainControl(method=tuneMth, 
                                                repeats=if (tuneMth=="repeatedcv") 5 else NA,
                                                sampling="smote",
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
            file.create(paste("Model Status/Methods", tuneMth, "completed", sep="_"))
            
            return(mod)
          } %>% 
  set_names(tuneMethods)

stopCluster(cl)
methodsEndTime <- proc.time()

Clear the model status dummy files:

# dir("Model Status", pattern="^Methods", full.names=TRUE) %>% file.remove()

Create and save a summary object:

smryMethods <- fstListMethods %>% fstSummary()
smryMethods %>% saveRDS("Model Status/SmryMethods.Rds")

The resampling results give k-fold cross-validation as the best-performing model, although there’s not much difference between methods in terms of performance:

smryMethods$resample %>% 
  group_by(Model) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  arrange(desc(AccAUC))

We can also plot the resampled results to get a better sense of the range of performance for each of these methods.

smryMethods$resample %>% 
  ggplot(aes(x=Model, y=AccAUC)) +
  geom_violin()

Training/test subset generation parameters

Now that we’ve selected repeated k-fold cross-validation as our subset generation method, what’s the optimal number of folds and the optimal number of repeats? For folds, We’ll sample a space of options around the default of 10: 4, 6, 8, 10, 12, 14. For repeats, we’ll sample a space of options around 5: 3, 5, 7.

foldsRepsStartTime <- proc.time()
tuneListFoldsReps <- expand.grid(Folds = seq(4, 14, by=2),
                                 Repeats = seq(3, 7, by=2))

tuneListFoldsReps <- tuneListFoldsReps %>% split(tuneListFoldsReps %>% nrow() %>% seq_len())
tuneListFoldsReps <- tuneListFoldsReps %>% set_names(tuneListFoldsReps %>% map(str_c, collapse=" "))

cl <- makeCluster(9)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListFoldsReps <-
  foreach(tuneNmRp = tuneListFoldsReps, 
          .packages=c("caret","dplyr")) %dopar% {
            file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "begun", sep="_"))
            mod <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=500,
                         trControl=trainControl(method="repeatedcv", 
                                                number=tuneNmRp$Folds,
                                                repeats=tuneNmRp$Repeats,
                                                sampling="smote",
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
            file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "completed", sep="_"))
            return(mod)
          } %>% 
  set_names(names(tuneListFoldsReps))

stopCluster(cl)
foldsRepsEndTime <- proc.time()

Clear the model status dummy files:

# dir("Model Status", pattern="^FoldsReps", full.names=TRUE) %>% file.remove()

Create another summary object:

smryFoldsReps <- fstListFoldsReps %>% fstSummary()
smryFoldsReps$resample <- smryFoldsReps$resample %>% 
  separate(Model, c("NumFolds", "Repeats"), sep=" ")
smryFoldsReps %>% saveRDS("Model Status/SmryFoldsReps.Rds")

The resampling results give the model with 12 folds as the best model:

smryFoldsReps$resample %>% 
  group_by(NumFolds, Repeats) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  arrange(desc(AccAUC))

ranger parameters (mtry, splitrule, min.node.size)

Define an initial search space for these parameters. A typical value of mtry for classification is the square root of the number of predictors, rounded down to the nearest integer, which in the case of the /r/ data is 13; so we’ll try a few mtry values around 13: 9, 11, 13, 15, 17. splitrule has two values, "gini" and "extratrees", so we’ll try both. Finally, the original randomForest package (Liaw & Wiener, 2002) uses min.node.size defaults of 1 for classification and 5 for regression, so for good measure we’ll try 1, 5, and 10. With respect to the continuous parameters (mtry and min.node.size), the best classifiers in this initial round may indicate values toward the end of the ranges we tuned; if that’s the case, we’ll do another round of tuning these parameters with a better-informed range of values.

tuneListStartTime <- proc.time()
tuneList <- expand.grid(mtry=seq(9, 17, by=2), 
                        splitrule=c("gini", "extratrees"), 
                        min.node.size=c(1, 5, 10), 
                        stringsAsFactors=FALSE)
tuneList <- tuneList %>% split(tuneList %>% nrow() %>% seq_len())
tuneList <- tuneList %>% set_names(tuneList %>% map(str_c, collapse=" "))
cl <- makeCluster(10)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListTuneGrid <- 
  foreach(tuneVals = tuneList, 
          .packages=c("caret","dplyr")) %dopar% {
            file.create(paste("Model Status/TuneGrid", 
                              paste(tuneVals, collapse=" "), 
                              "begun", sep="_"))
            mod <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=500,
                         trControl=trainControl(method="repeatedcv", 
                                                number=12,
                                                repeats=3,
                                                sampling="smote",
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=tuneVals)
            file.create(paste("Model Status/TuneGrid", 
                              paste(tuneVals, collapse=" "), 
                              "completed", sep="_"))
            return(mod)
          } %>% 
  set_names(names(tuneList))
tuneListEndTime <- proc.time()

Clear the model status dummy files:

# dir("Model Status", pattern="^TuneGrid", full.names=TRUE) %>% file.remove()

To assess which tuning parameters are the best, we collect some of the most salient characteristics of the model (resampling results, confusion matrix, variable importance, and running time) into a summary object. We’ll also save this summary in its own .Rds file, in case something goes awry and we lose our R session (which is a risk when you’re trying to hold large objects in memory).

smryTuneGrid <- fstListTuneGrid %>% fstSummary()
smryTuneGrid %>% saveRDS("Model Status/SmryTuneGrid.Rds")

The resampled performance measures indicate the best performance with mtry=15, splitrule="gini", and min.node.size=1

smryTuneGrid$resample %>% 
  group_by(mtry, splitrule, min.node.size) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  arrange(desc(AccAUC))

Number of trees

Finally, we can adjust the number of trees in our forest. We’ve been using 500 trees, but it seems reasonable to try smaller or larger forests. We’ll try forests of size 250, 500, 750, and 1000; if it seems that our forests’ performance only increases with greater size, we can try larger forests still.

treesStartTime <- proc.time()
tuneNumTrees <- seq(250, 1000, by=250)

cl <- makeCluster(tuneNumTrees %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListNumTrees <-
  foreach(tuneNmTr = tuneNumTrees, 
          .packages=c("caret","dplyr")) %dopar% {
            file.create(paste("Model Status/NumTrees", tuneNmTr, "begun", sep="_"))
            mod <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=tuneNmTr,
                         trControl=trainControl(method="repeatedcv",
                                                number=12,
                                                repeats=3, 
                                                sampling="smote",
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
            file.create(paste("Model Status/NumTrees", tuneNmTr, "completed", sep="_"))
            return(mod)
          } %>% 
  set_names(tuneNumTrees)

stopCluster(cl)
treesEndTime <- proc.time()

Clear the model status dummy files:

# dir("Model Status", pattern="^NumTrees", full.names=TRUE) %>% file.remove()

Create and save another summary object:

smryNumTrees <- fstListNumTrees %>% fstSummary()
smryNumTrees %>% saveRDS("Model Status/SmryNumTrees.Rds")

The resampled performance measures are best for the model with 750 trees, so we’ll keep that setting, though we acknowledge that there is no real difference in performance between 250, 500, 750, and 1000 trees (that is, if our model took a very long time to run, we could get away with using just 250 trees).

smryNumTrees$resample %>% 
  group_by(Model) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  arrange(desc(AccAUC))

Summary: Hyperparameter tuning

After the tuning process, our classifier has the following hyperparameter settings:

  • Additional resampling: SMOTE

  • Training/test subset generation method: k-fold cross-validation

  • Training/test subset generation parameters: 10 folds

  • ranger parameters: mtry=13, splitrule="gini", min.node.size=10

  • Number of trees: 750

Let’s train another classifier with these parameter settings to get a snapshot of performance after tuning hyperparameters (since we’re only training one model, we’ll dispatch with the parallel backend):

tunedStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
fstTuned <- train(Rpresent ~ .,
                  data=trainingData %>%
                    filter(HowCoded=="Hand") %>%
                    select(Rpresent, TokenDur:absSlopeF0),
                  method="ranger",
                  importance="impurity",
                  num.trees=750,
                  trControl=trainControl(method="repeatedcv", 
                                         number=12,
                                         repeats=3,
                                         sampling="smote",
                                         savePredictions=TRUE,
                                         summaryFunction=cutoffSummary,
                                         returnResamp="all",
                                         classProbs=TRUE),
                  tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
tunedEndTime <- proc.time()

Create and save a summary object:

smryTuned <- fstTuned %>% fstSummary()
smryTuned %>% saveRDS("Model Status/SmryTuned.Rds")

Here are the performance measures for our tuned classifier:

smryTuned$resample %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE)

Finally, we remove forest objects to avoid overtaxing memory (depending on your system, it might be a good idea to do this after each hyperparameter), except for fstTuned.

rm(fstListMethods, fstListFoldsReps, fstListNumTrees, fstListResamples, fstListTuneGrid)

Step 5: Determine effect of outliers

So far, we’ve been training our classifiers on the entire set of hand-coded data available to us, without excluding any outliers, but it’s worth questioning whether this is the right approach. One the one hand, bigger is better from the perspective of both training and testing: our sample size simulations demonstrate the positive effect of training set size on classifier performance, and the more permissive we are about outliers, the fewer test set tokens we throw out as ‘uncodeable.’ On the other hand, the principle of ‘garbage in, garbage out’ warns us against trusting the predictions of a model trained on bad data.

We already know how our model performs when we’re maximally permissive about outliers. Let’s train another model where we’re maximally conservative about outliers, discarding every token with even a single outlying measurement:

noOutStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
fstNoOutlier <- train(Rpresent ~ .,
                      data=trainingData %>% 
                        filter(HowCoded=="Hand") %>% 
                        filter_at(vars(ends_with("_Outlier")), all_vars(!.)) %>%
                        select(Rpresent, TokenDur:absSlopeF0),
                      method="ranger",
                      importance="impurity",
                      num.trees=750,
                      trControl=trainControl(method="repeatedcv", 
                                             number=12,
                                             repeats=3,
                                             sampling="smote",
                                             savePredictions=TRUE,
                                             summaryFunction=cutoffSummary,
                                             returnResamp="all",
                                             classProbs=TRUE),
                      tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))

As the performance measures below show, this no-outlier classifier improves considerably over our previous model.

noOutEndTime <- proc.time()
fstNoOutlier$resample %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  select(AccAUC, Accuracy, AUC, everything())
rm(fstNoOutlier)

But what is the cost for this improved performance? If we only keep tokens with zero outlying measurements, we throw out about 40% of our tokens for both the hand-coded training set and the auto-coded test set. This is because we have a lot of formant timepoint measures, so the likelihood of outliers is high even if the likelihood of meaningful outliers is low. Indeed, only 5% of all formant timepoint measurements are outliers (a direct consequence of our 2-SD cutoff); among tokens with at least one outlying measurement, half have fewer than 4 outliers.

trainingData %>% 
  mutate(PossOutlier = apply(select(., ends_with("_Outlier")), 1, any)) %$%
  table(HowCoded, PossOutlier)
        PossOutlier
HowCoded FALSE  TRUE
    Auto 17204 15915
    Hand  2978  2642
trainingData %>% 
  select(ends_with("_Outlier")) %>%
  summarise_all(mean) %>% 
  apply(1, mean)
[1] 0.04197241
trainingData %>% 
  mutate(NumOutlier = apply(select(., ends_with("_Outlier")), 1, sum)) %>%
  filter(NumOutlier > 0) %>% 
  ggplot(aes(x=NumOutlier)) +
  geom_histogram(binwidth=1) +
  facet_grid(HowCoded ~ .)

In other words, where outliers are concerned, we have to achieve a balance. It’s likely that some outliers hurt our classifier’s performance so much that our classifier is better without them, and other outliers hurt performance to such a minor degree that it’s worth keeping those tokens. In addition, with so many acoustic measures, the likelihood of outliers is high, so we want to be careful about which outliers we can afford to keep and which ones are detrimental to performance.

Here is the procedure we’ll follow. We’ll train a set of 10 classifiers (to lessen statistical noise) and obtain a ranking of which measures’ outliers most hurt classifier performance. We’ll train another set of 10 classifiers where we drop outliers in the top-ranked measure, and we’ll compare performance in the classifiers that drop outliers to the ones that don’t drop outliers. If there is a significant difference in performance, we will consider the loss in data to be worth the gain in performance; we’ll obtain an updated ranking of measures from this set of classifiers, and we’ll proceed to train another set of classifiers that drops outliers in the top-ranked measure. If there is no significant difference in performance, we’ll move down to the next measure in the ranking to see if dropping its outliers leads to a significant gain in performance. Once we reach three consecutive nonsignificant results, we’ll stop. By way of this procedure, we’ll determine which outliers are most detrimental to performance and which ones we can afford to keep.

Find ranking of outliers

To lessen the effect of statistical noise, we’ll train 5 models for each step. We’ll use parallel processing as before, but with one additional bit of machinery. Previously, we ensured reproducibility by using parallel::clusterSetRNGStream() to pass the same random seed to each parallel process. If we did the same thing here, we would end up with 5 repetitions of the exact same model running in parallel! Not exactly the best use of parallel processing. As a result, we’ll also pass different vectors of random seeds to the seeds argument of trainControl(), ensuring that train() generates different training and test subsets from one parallel model to the next.

registerDoSEQ()
set.seed(302302)
##For repeated k-fold CV with n repeats, trainControl(seeds) must be a list of
##  (n*k)+1 lists, each containing M integers (M = number of models being 
##  evaluated within each call to train())
seedList <- replicate(5,
                      replicate(37, sample.int(10000, 1), simplify=FALSE),
                      simplify=FALSE)

First, train a list of 5 classifiers on the full data set (i.e., without dropping any outliers):

drop0StartTime <- proc.time()
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut0 <- foreach(seed = seedList, 
                           .packages=c("caret","dplyr")) %dopar% {
                             train(Rpresent ~ .,
                                   data=trainingData %>% 
                                     filter(HowCoded=="Hand") %>% 
                                     select(Rpresent, TokenDur:absSlopeF0),
                                   method="ranger",
                                   importance="impurity",
                                   num.trees=750,
                                   trControl=trainControl(method="repeatedcv", 
                                                          number=12,
                                                          repeats=3,
                                                          sampling="smote",
                                                          seeds=seed,
                                                          savePredictions=TRUE,
                                                          summaryFunction=cutoffSummary,
                                                          returnResamp="all",
                                                          classProbs=TRUE),
                                   tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                           }

stopCluster(cl)
drop0EndTime <- proc.time()

Extract and save a summary:

smryDropOut0 <- fstListDropOut0 %>% fstSummary()
smryDropOut0 %>% saveRDS("Model Status/SmryDropOut0.Rds")

For each forest in this list, we can assess the effect of each outlier on performance by comparing the proportions of outliers for different measures in each fold and each fold’s performance. We pull each token’s fold assignment from the pred element of each forest and add it to each token’s outlier data:

fstListDropOut0[[1]] %>% 
  extract2("pred") %>% 
  arrange(rowIndex) %>% 
  select(Resample) %>% 
  cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier")))

From this data, we can find the proportion of outliers for each measure in each fold (within each repetition):

fstListDropOut0[[1]] %>% 
  extract2("pred") %>% 
  arrange(rowIndex) %>% 
  select(Resample) %>% 
  cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>% 
  group_by(Resample) %>% 
  summarise_all(mean)

We then take performance data from the resample element of the forest and join it to the outlier data:

fstListDropOut0[[1]] %>% 
  extract2("resample") %>% 
  select(AccAUC, Resample)
fstListDropOut0[[1]] %>% 
  extract2("pred") %>% 
  arrange(rowIndex) %>% 
  select(Resample) %>% 
  cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>% 
  group_by(Resample) %>% 
  summarise_all(mean) %>% 
  left_join(fstListDropOut0[[1]] %>% 
              extract2("resample") %>% 
              select(AccAUC, Resample),
            by="Resample")

After we put this data together for all 5 classifiers in the list, we pass the resulting dataframe to cor() to get correlations, and we pull out just the correlations we care about: the correlations between AccAUC and the proportions of outliers. Finally, we store the data in a dataframe and, to save memory, remove the forest list from our R environment. The following function puts all these calculations together:

##Find the outliers that most hurt classifier performance by calculating the
##  correlation between folds' test performance and proportion of outliers
##This code works for repeated k-fold cross-validation, but it'll need to be
##  tweaked for different methods of generating training/test subsets.
outlierCors <- function(fstList, data=trainingData, dropped=NULL) {
  ##Get fold info (performance and proportion of outliers) for each forest; if
  ##  outliers have already been dropped, don't include them in the calculation
  if (is.null(dropped) | length(dropped)==0) {
    foldInfo <- function(fst, data, dropped) {
      data %>% 
        filter(HowCoded=="Hand") %>% 
        cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
        group_by(Resample) %>%
        summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
        left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
    } 
  } else {
    foldInfo <- function(fst, data, dropped) {
      data %>% 
        filter(HowCoded=="Hand") %>% 
        filter_at(vars(dropped), all_vars(!.)) %>% 
        cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
        group_by(Resample) %>%
        summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
        select(-dropped) %>% 
        left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
    }
  }
  
  ##Get fold info for all forests and put them together into a dataframe
  ret <-
    fstList %>% 
    map_dfr(foldInfo, data=data, dropped=dropped) %>%
    select(-Resample) %>%
    select(AccAUC, everything()) %>%
    ##Get correlations between AccAUC and outlier proportions
    cor() %>%
    ##Reduce correlation matrix to a vector
    extract(1, -1) %>%
    sort() %>%
    tibble(Measure = names(.), TestCor = .)
  
  ret
}

Running this function on our forest list with no outliers dropped, we find an initial ranking of outliers with F3_80 as the outlier that most hurts classifier performance. We’ll try removing F3_80 outliers first.

(negOutliers <- fstListDropOut0 %>% outlierCors(dropped=NULL))
rm(fstListDropOut0)

Drop outliers in one measure at a time

Start by initializing a character vector (empty, for now) where we’ll store measures that we decide to drop:

dropped <- character(0L)

Now train another list of 5 classifiers, this time dropping tokens with outliers in the measure with the biggest negative impact on performance, F3_80:

drop11StartTime <- proc.time()
tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut1_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop11EndTime <- proc.time()

Create and save a summary object:

smryDropOut1_1 <- fstListDropOut1_1 %>% fstSummary()
smryDropOut1_1 %>% saveRDS("Model Status/SmryDropOut1_1.Rds")

Now we can compare the overall accuracy \(\times\) AUC for the 10 classifiers that drop no outliers vs. the overall accuracy \(\times\) AUC for the 10 classifiers that drop tokens with F3_80 outliers. The latter model well outperforms the former, indicating that the performance gain from dropping F3_80 outliers is worth the loss in data.

list(smryDropOut0, smryDropOut1_1) %>% 
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

Now, we’ll proceed to drop outliers for one measure at a time and compare classifier performance after each step using Wilcoxon rank-sum tests (rather than its parametric counterpart, an independent samples t-test). The distributions of classifier performance fail to satisfy the assumption of normality that is necessary to use an independent samples t-test, so instead we compare performance using a Wilcoxon rank-sum test. We’ll drop outliers until we get three consecutive results showing that classifier performance fails to significantly improve at \(\alpha\) = .05. In this case, we find a highly significant increase in classifier performance, which justifies dropping tokens with outlying F3_80 measurements.

list(smryDropOut0, smryDropOut1_1) %>% 
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 46, p-value = 0.00247
alternative hypothesis: true location shift is less than 0

So we add F3_80 to our established set of dropped outliers, meaning we drop about 3% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut1_1 %>% outlierCors(dropped=dropped))
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(dropped)` instead of `dropped` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3_80 and intens_F3max:

drop21StartTime <- proc.time()
rm(fstListDropOut1_1)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut2_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop21EndTime <- proc.time()

Create and save a summary object:

smryDropOut2_1 <- fstListDropOut2_1 %>% fstSummary()
smryDropOut2_1 %>% saveRDS("Model Status/SmryDropOut2_1.Rds")

The plot of performance suggests little performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.

list(smryDropOut1_1, smryDropOut2_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 0, p-value = 6.447e-09
alternative hypothesis: true location shift is less than 0

So we move on to the next measure in the ranking, F2_35, training a list of classifiers on data that drops outliers in F2_35 in addition to our established set of dropped outliers.

drop22StartTime <- proc.time()
rm(fstListDropOut2_1)

tryDrop <- negOutliers$Measure[2]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut2_2 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop22EndTime <- proc.time()

Create and save a summary object:

smryDropOut2_2 <- fstListDropOut2_2 %>% fstSummary()
smryDropOut2_2 %>% saveRDS("Model Status/SmryDropOut2_2.Rds")

The plot of performance suggests further performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test indicates a significant performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.

list(smryDropOut1_1, smryDropOut2_2) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 10, p-value = 8.961e-07
alternative hypothesis: true location shift is less than 0

So we add F2_35 to our established set of dropped outliers, meaning we drop about 8% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut2_2 %>% outlierCors(dropped=dropped))

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in intens_F3min, in addition to our established set of dropped outliers:

drop31StartTime <- proc.time()
rm(fstListDropOut2_2)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut3_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop31EndTime <- proc.time()

Create and save a summary object:

smryDropOut3_1 <- fstListDropOut3_1 %>% fstSummary()
smryDropOut3_1 %>% saveRDS("Model Status/SmryDropOut3_1.Rds")

The plot of performance suggests further performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test indicates a significant performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.

list(smryDropOut2_2, smryDropOut3_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 92, p-value = 0.2062
alternative hypothesis: true location shift is less than 0

So we add intens_F3min to our established set of dropped outliers, meaning we drop about 9.5% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut3_1 %>% outlierCors(dropped=dropped))

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_70, in addition to our established set of dropped outliers:

drop41StartTime <- proc.time()
rm(fstListDropOut3_1)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut4_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop41EndTime <- proc.time()

Create and save a summary object:

smryDropOut4_1 <- fstListDropOut4_1 %>% fstSummary()
smryDropOut4_1 %>% saveRDS("Model Status/SmryDropOut4_1.Rds")

The plot of performance suggests further performance gain from dropping F1_70 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.

list(smryDropOut3_1, smryDropOut4_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 31, p-value = 0.000197
alternative hypothesis: true location shift is less than 0

So we add F1_70 to our established set of dropped outliers, meaning we drop about 13% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut4_1 %>% outlierCors(dropped=dropped))

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_50, in addition to our established set of dropped outliers:

drop51StartTime <- proc.time()
rm(fstListDropOut4_1)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut5_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop51EndTime <- proc.time()

Create and save a summary object:

smryDropOut5_1 <- fstListDropOut5_1 %>% fstSummary()
smryDropOut5_1 %>% saveRDS("Model Status/SmryDropOut5_1.Rds")

The plot of performance suggests some further performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test indicates a significant performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.

list(smryDropOut4_1, smryDropOut5_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 61, p-value = 0.01647
alternative hypothesis: true location shift is less than 0

So we add F1_50 to our established set of dropped outliers, meaning we drop about 15% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut5_1 %>% outlierCors(dropped=dropped))

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3min, in addition to our established set of dropped outliers:

drop61StartTime <- proc.time()
rm(fstListDropOut5_1)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut6_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop61EndTime <- proc.time()

Create and save a summary object:

smryDropOut6_1 <- fstListDropOut6_1 %>% fstSummary()
smryDropOut6_1 %>% saveRDS("Model Status/SmryDropOut6_1.Rds")

The plot of performance suggests some further performance gain from dropping F3min outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test indicates a significant performance gain from dropping F3min outliers, compared to our established set of dropped outliers.

list(smryDropOut5_1, smryDropOut6_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 6, p-value = 1.934e-07
alternative hypothesis: true location shift is less than 0

So we add F3min to our established set of dropped outliers, meaning we drop about 17% of tokens:

dropped <- c(dropped, tryDrop)
tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

We re-rank measures based on the results of this list of classifiers:

(negOutliers <- fstListDropOut6_1 %>% outlierCors(dropped=dropped))

And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F4_40, in addition to our established set of dropped outliers:

drop71StartTime <- proc.time()
rm(fstListDropOut6_1)

tryDrop <- negOutliers$Measure[1]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut7_1 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop71EndTime <- proc.time()

Create and save a summary object:

smryDropOut7_1 <- fstListDropOut7_1 %>% fstSummary()
smryDropOut7_1 %>% saveRDS("Model Status/SmryDropOut7_1.Rds")

The plot of performance suggests little performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_1) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.

list(smryDropOut6_1, smryDropOut7_1) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 62, p-value = 0.01836
alternative hypothesis: true location shift is less than 0

So we move on to the next measure in the ranking, F4_50, training a list of classifiers on data that drops outliers in F4_50 in addition to our established set of dropped outliers.

drop72StartTime <- proc.time()
rm(fstListDropOut7_1)

tryDrop <- negOutliers$Measure[2]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut7_2 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop72EndTime <- proc.time()

Create and save a summary object:

smryDropOut7_2 <- fstListDropOut7_2 %>% fstSummary()
smryDropOut7_2 %>% saveRDS("Model Status/SmryDropOut7_2.Rds")

The plot of performance suggests little performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_2) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.

list(smryDropOut6_1, smryDropOut7_2) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 16, p-value = 5.886e-06
alternative hypothesis: true location shift is less than 0

So we move on to the next measure in the ranking, F1_60, training a list of classifiers on data that drops outliers in F1_60 in addition to our established set of dropped outliers. Since that’s been two consecutive nonsignificant results, we’ll stop this process if the next result is also nonsignificant.

drop73StartTime <- proc.time()
rm(fstListDropOut7_2)

tryDrop <- negOutliers$Measure[3]

cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)

fstListDropOut7_3 <- foreach(seed = seedList, 
                             .packages=c("caret","dplyr")) %dopar% {
                               train(Rpresent ~ .,
                                     data=trainingData %>% 
                                       filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>% 
                                       filter(HowCoded=="Hand") %>% 
                                       select(Rpresent, TokenDur:absSlopeF0),
                                     method="ranger",
                                     importance="impurity",
                                     num.trees=750,
                                     trControl=trainControl(method="repeatedcv", 
                                                            number=12,
                                                            repeats=3,
                                                            sampling="smote",
                                                            seeds=seed,
                                                            savePredictions=TRUE,
                                                            summaryFunction=cutoffSummary,
                                                            returnResamp="all",
                                                            classProbs=TRUE),
                                     tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
                             }

stopCluster(cl)
drop73EndTime <- proc.time()

Create and save a summary object:

smryDropOut7_3 <- fstListDropOut7_3 %>% fstSummary()
smryDropOut7_3 %>% saveRDS("Model Status/SmryDropOut7_3.Rds")

The plot of performance suggests little performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_3) %>%
  set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.

list(smryDropOut6_1, smryDropOut7_3) %>%
  set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, 
            levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>% 
  wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)

    Wilcoxon rank sum exact test

data:  AccAUC by Dropped
W = 75, p-value = 0.06307
alternative hypothesis: true location shift is less than 0

Since that’s been three consecutive nonsignificant results, we stop our outlier-dropping process there.

rm(fstListDropOut7_3)

Summary: Outlier dropping

We’ve dropped outliers in six measures for which outliers significantly hurt classifier performance: F3_80, F2_35, intens_F3min, F1_70, F1_50, and F3min. We determined which outliers to drop not in terms of variable importance—most of these measures did not rank particularly high in variable importance—but in terms of which measures’ outliers were most negatively correlated with classifier performance.

dropped %>% 
  str_remove("_Outlier") %>% 
  set_names(., .) %>% 
  map_int(~ smryTuned$varImp %>% sort(decreasing=TRUE) %>% names() %>% equals(.x) %>% which())
intens_F3min        F2_50        F4_55        F3_75        F2_20        F1_80 
          14           40           64           61          137           58 

Dropping these outliers comes at the price of about 1/6 of all auto-coded and hand-coded tokens.

tibble(Dropped = "None", 
       Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
       Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>% 
  rbind(
    cbind(Dropped = dropped,
          map_dfr(dropped %>% seq_along(),
                  ~ trainingData %>%
                    filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
                    count(HowCoded) %>%
                    spread(HowCoded, n)
          )
    )
  ) %>%
  mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))

The loss of these tokens pays off in an incremental increase in the performance of our classifier.

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
  set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             separate(Resample, c("Fold","Rep"), "\\.") %>% 
             group_by(Model, Rep) %>% 
             summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>% 
  ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
  geom_density(alpha=0.5)

list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1, 
     smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
  set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
  imap_dfr(~ .x %>% 
             extract2("resample") %>% 
             mutate(Dropped = .y)) %>% 
  mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>% 
  select(-Model) %>% 
  group_by(Dropped) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE)

Step 6: Train final classifier

Next, we’ll train a final classifier with the settings we’ve tuned.

finalStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
finalClassifier <- train(Rpresent ~ .,
                         data=trainingData %>% 
                           filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier, 
                                          F1_70_Outlier, F1_50_Outlier, F3min_Outlier), 
                                     all_vars(!.)) %>% 
                           filter(HowCoded=="Hand") %>% 
                           select(Rpresent, TokenDur:absSlopeF0),
                         method="ranger",
                         importance="impurity",
                         num.trees=750,
                         trControl=trainControl(method="repeatedcv", 
                                                number=12,
                                                repeats=3,
                                                sampling="smote",
                                                savePredictions=TRUE,
                                                summaryFunction=cutoffSummary,
                                                returnResamp="all",
                                                classProbs=TRUE),
                         tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
finalEndTime <- proc.time()

Here are the properties of our final classifier: performance measures, confusion matrix, variable importance, and timing. We’ll save this final classifier to file.

finalClassifier %>% fstSummary()
$resample
      AccAUC  Accuracy       AUC BestCutoff AccuracyHalf ClassAccuracy_Absent
1  0.7492338 0.8439898 0.8877285  0.4966434    0.8388747            0.8794326
2  0.7699420 0.8615385 0.8936827  0.5822598    0.8435897            0.9539007
3  0.7151271 0.8414322 0.8498926  0.5689011    0.8184143            0.9397163
4  0.7360015 0.8363171 0.8800508  0.5825537    0.8260870            0.9361702
5  0.7770717 0.8618926 0.9015876  0.5591434    0.8439898            0.9609929
6  0.7040322 0.8363171 0.8418245  0.6414161    0.8081841            0.9397163
7  0.7103110 0.8307692 0.8550039  0.5158831    0.8179487            0.9113475
8  0.7654916 0.8567775 0.8934544  0.5500653    0.8414322            0.9326241
9  0.7423077 0.8307692 0.8935185  0.4710042    0.8256410            0.8439716
10 0.7386690 0.8414322 0.8778710  0.6127693    0.8235294            0.9468085
11 0.7570093 0.8542199 0.8861995  0.5068587    0.8516624            0.9184397
12 0.7162556 0.8388747 0.8538291  0.5211365    0.8158568            0.9078014
13 0.7234014 0.8282051 0.8734568  0.6382508    0.8256410            0.9609929
14 0.7494377 0.8491049 0.8826209  0.5053540    0.8465473            0.9113475
15 0.7157915 0.8260870 0.8664845  0.6495841    0.8107417            0.9574468
16 0.7000082 0.8209719 0.8526579  0.5644640    0.8132992            0.9468085
17 0.7738050 0.8695652 0.8898757  0.5893984    0.8439898            0.9361702
18 0.7948897 0.8746803 0.9087774  0.5369598    0.8670077            0.9255319
19 0.7443902 0.8414322 0.8846704  0.5995884    0.8081841            0.9539007
20 0.7406902 0.8461538 0.8753612  0.5259228    0.8358974            0.9290780
21 0.7640892 0.8618926 0.8865248  0.5648476    0.8491049            0.9432624
22 0.7430284 0.8439898 0.8803761  0.6097016    0.8363171            0.9503546
23 0.7281709 0.8286445 0.8787494  0.6216042    0.8107417            0.9255319
24 0.7330484 0.8512821 0.8611111  0.6131339    0.8307692            0.9503546
25 0.7405847 0.8439898 0.8774806  0.5499471    0.8363171            0.9148936
26 0.7305783 0.8307692 0.8793998  0.5567836    0.8179487            0.9042553
27 0.7257802 0.8388747 0.8651832  0.5242381    0.8286445            0.9184397
28 0.8037395 0.8746803 0.9188952  0.4453212    0.8670077            0.9007092
29 0.7257177 0.8286445 0.8757889  0.5676569    0.8260870            0.9219858
30 0.7796243 0.8670077 0.8992127  0.5159556    0.8593350            0.9255319
31 0.7001048 0.8312020 0.8422799  0.6177553    0.8209719            0.9539007
32 0.7594345 0.8618926 0.8811243  0.6095299    0.8363171            0.9539007
33 0.7461148 0.8410256 0.8871487  0.6698905    0.8205128            0.9716312
34 0.6881501 0.8179487 0.8413121  0.6578153    0.7948718            0.9539007
35 0.7455440 0.8516624 0.8753985  0.5263952    0.8388747            0.9255319
36 0.7487396 0.8439898 0.8871430  0.5169365    0.8465473            0.8829787
   ClassAccuracy_Present ClassAccuracyHalf_Absent ClassAccuracyHalf_Present
1              0.7522936                0.8794326                 0.7339450
2              0.6203704                0.9078014                 0.6759259
3              0.5871560                0.8758865                 0.6697248
4              0.5779817                0.8900709                 0.6605505
5              0.6055046                0.9148936                 0.6605505
6              0.5688073                0.8546099                 0.6880734
7              0.6203704                0.8936170                 0.6203704
8              0.6605505                0.8900709                 0.7155963
9              0.7962963                0.8581560                 0.7407407
10             0.5688073                0.8758865                 0.6880734
11             0.6880734                0.9148936                 0.6880734
12             0.6605505                0.8723404                 0.6697248
13             0.4814815                0.8900709                 0.6574074
14             0.6880734                0.9042553                 0.6972477
15             0.4862385                0.8758865                 0.6422018
16             0.4954128                0.8971631                 0.5963303
17             0.6972477                0.8794326                 0.7522936
18             0.7431193                0.8971631                 0.7889908
19             0.5504587                0.8617021                 0.6697248
20             0.6296296                0.8971631                 0.6759259
21             0.6513761                0.9042553                 0.7064220
22             0.5688073                0.8900709                 0.6972477
23             0.5779817                0.8687943                 0.6605505
24             0.5925926                0.8829787                 0.6944444
25             0.6605505                0.8829787                 0.7155963
26             0.6388889                0.8617021                 0.7037037
27             0.6330275                0.8936170                 0.6605505
28             0.8073394                0.9184397                 0.7339450
29             0.5871560                0.8900709                 0.6605505
30             0.7155963                0.9113475                 0.7247706
31             0.5137615                0.8936170                 0.6330275
32             0.6238532                0.8865248                 0.7064220
33             0.5000000                0.8794326                 0.6666667
34             0.4629630                0.8723404                 0.5925926
35             0.6605505                0.9042553                 0.6697248
36             0.7431193                0.8758865                 0.7706422
   mtry splitrule min.node.size    Resample
1    13      gini            10 Fold01.Rep1
2    13      gini            10 Fold02.Rep1
3    13      gini            10 Fold03.Rep1
4    13      gini            10 Fold04.Rep1
5    13      gini            10 Fold05.Rep1
6    13      gini            10 Fold06.Rep1
7    13      gini            10 Fold07.Rep1
8    13      gini            10 Fold08.Rep1
9    13      gini            10 Fold09.Rep1
10   13      gini            10 Fold10.Rep1
11   13      gini            10 Fold11.Rep1
12   13      gini            10 Fold12.Rep1
13   13      gini            10 Fold01.Rep2
14   13      gini            10 Fold02.Rep2
15   13      gini            10 Fold03.Rep2
16   13      gini            10 Fold04.Rep2
17   13      gini            10 Fold05.Rep2
18   13      gini            10 Fold06.Rep2
19   13      gini            10 Fold07.Rep2
20   13      gini            10 Fold08.Rep2
21   13      gini            10 Fold09.Rep2
22   13      gini            10 Fold10.Rep2
23   13      gini            10 Fold11.Rep2
24   13      gini            10 Fold12.Rep2
25   13      gini            10 Fold01.Rep3
26   13      gini            10 Fold02.Rep3
27   13      gini            10 Fold03.Rep3
28   13      gini            10 Fold04.Rep3
29   13      gini            10 Fold05.Rep3
30   13      gini            10 Fold06.Rep3
31   13      gini            10 Fold07.Rep3
32   13      gini            10 Fold08.Rep3
33   13      gini            10 Fold09.Rep3
34   13      gini            10 Fold10.Rep3
35   13      gini            10 Fold11.Rep3
36   13      gini            10 Fold12.Rep3

$confMat
         Actual
Predicted Absent Present
  Absent    9410    1565
  Present    742    2350

$varImp
    TokenDur        F1_20  diffF3F1_20       BW1_20        F1_25  diffF3F1_25 
   30.764500    14.989296    27.774775    12.594526    14.898406    35.396210 
      BW1_25        F1_30  diffF3F1_30       BW1_30        F1_35  diffF3F1_35 
   11.363728    14.465964    56.675720    11.031246    15.037537    53.184867 
      BW1_35        F1_40  diffF3F1_40       BW1_40        F1_45  diffF3F1_45 
   11.644876    15.900869    88.474917    11.188914    20.208349    82.429002 
      BW1_45        F1_50  diffF3F1_50       BW1_50        F1_55  diffF3F1_55 
   11.109157    21.000783    98.752872    11.300654    22.814180   105.090890 
      BW1_55        F1_60  diffF3F1_60       BW1_60        F1_65  diffF3F1_65 
   10.484514    20.028613   101.340742    10.741350    21.784939   107.645116 
      BW1_65        F1_70  diffF3F1_70       BW1_70        F1_75  diffF3F1_75 
   11.684349    23.100147    85.321544    11.364907    17.120429    69.676064 
      BW1_75        F1_80  diffF3F1_80       BW1_80        F1min        F1max 
   13.908952    16.264453    71.695080    13.982099    13.978461    16.244214 
  time_F1min   time_F1max      F1range time_F1range      slopeF1        F2_20 
   10.716749    15.976890    15.492185    14.778713    15.017035    13.418690 
 diffF3F2_20       BW2_20        F2_25  diffF3F2_25       BW2_25        F2_30 
   18.594373    15.508765    14.383006    18.672362    16.197715    14.697420 
 diffF3F2_30       BW2_30        F2_35  diffF3F2_35       BW2_35        F2_40 
   22.556039    14.553842    18.917379    31.091438    14.986958    18.764349 
 diffF3F2_40       BW2_40        F2_45  diffF3F2_45       BW2_45        F2_50 
   57.603532    14.368941    21.673919    41.670748    14.341450    25.075409 
 diffF3F2_50       BW2_50        F2_55  diffF3F2_55       BW2_55        F2_60 
   33.157534    15.661945    26.726035    58.844238    15.921802    27.298895 
 diffF3F2_60       BW2_60        F2_65  diffF3F2_65       BW2_65        F2_70 
   70.402097    15.146769    27.666039    58.206411    15.455432    30.522271 
 diffF3F2_70       BW2_70        F2_75  diffF3F2_75       BW2_75        F2_80 
   67.469697    13.632785    29.495273    69.273594    14.613164    28.612871 
 diffF3F2_80       BW2_80        F2min        F2max   time_F2min   time_F2max 
   59.781687    19.667721    29.827593    18.538668    14.969185    14.247801 
     F2range time_F2range      slopeF2        F3_20       BW3_20        F3_25 
   15.528431    13.697598    13.794399    16.299933    12.529338    14.299296 
      BW3_25        F3_30       BW3_30        F3_35       BW3_35        F3_40 
   12.260252    16.969548    11.644058    15.191504    12.540938    19.668322 
      BW3_40        F3_45       BW3_45        F3_50       BW3_50        F3_55 
   11.902571    23.031201    11.924540    23.296989    12.930897    20.841978 
      BW3_55        F3_60       BW3_60        F3_65       BW3_65        F3_70 
   14.252782    23.552730    13.918116    22.068816    15.685204    27.661179 
      BW3_70        F3_75       BW3_75        F3_80       BW3_80        F3min 
   14.206712    24.406039    15.252917    22.914951    14.832923    46.059445 
       F3max   time_F3min   time_F3max      F3range time_F3range      slopeF3 
   16.779745    10.472060    10.696035    22.102011    12.300931    21.105624 
intens_F3min intens_F3max        F4_20  diffF4F3_20       BW4_20        F4_25 
   62.546789    57.160306    14.077990    17.383573    14.463928    13.705212 
 diffF4F3_25       BW4_25        F4_30  diffF4F3_30       BW4_30        F4_35 
   15.106839    12.998409    14.097109    14.712706    13.672993    16.610082 
 diffF4F3_35       BW4_35        F4_40  diffF4F3_40       BW4_40        F4_45 
   15.627823    13.311154    18.875092    16.406844    12.283367    18.648821 
 diffF4F3_45       BW4_45        F4_50  diffF4F3_50       BW4_50        F4_55 
   16.372688    12.600801    18.719656    17.039564    13.895799    16.920489 
 diffF4F3_55       BW4_55        F4_60  diffF4F3_60       BW4_60        F4_65 
   16.854955    12.618690    16.277981    15.287159    12.141336    15.556432 
 diffF4F3_65       BW4_65        F4_70  diffF4F3_70       BW4_70        F4_75 
   14.857038    11.304223    15.001727    12.751703    11.673830    14.855942 
 diffF4F3_75       BW4_75        F4_80  diffF4F3_80       BW4_80        F4min 
   13.773128    11.720099    16.228348    14.589646    14.368805    22.182335 
       F4max   time_F4min   time_F4max      F4range time_F4range      slopeF4 
   14.237622    14.364799    12.857766    17.628338    12.674008    18.742736 
       F0min        F0max    F0rangeST   time_F0min   time_F0max   absSlopeF0 
   43.821598    40.135800    19.322108     9.391833    10.985579    12.539335 

$times
$times$everything
    user   system  elapsed 
1725.510   23.025  401.573 

$times$final
   user  system elapsed 
 52.869   0.862  12.695 
finalClassifier %>% saveRDS("Data/ClassifierR.Rds")

Step 7: Auto-code new tokens and extract classifier probabilities for hand-coded tokens

Auto-coding is easy and fast. To get continuous probabilities, we pass the final classifier and non-hand-coded data (without the dependent variable and with relevant measures only) to predict() with type="prob".

predict() is what’s called a “generic function” in R, with lots of different methods for objects of different classes that are passed to it. Since our final classifier is an object of class train (see class(finalClassifier)), calling predict(finalClassifier) passes the finalClassifier object to the predict.train() function that’s loaded with the caret package. To see all the methods for a generic function like predict(), run methods("predict").

preds <-
  predict(finalClassifier,
          trainingData %>% 
            filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier, 
                           F1_70_Outlier, F1_50_Outlier, F3min_Outlier), 
                      all_vars(!.)) %>% 
            filter(HowCoded=="Auto") %>% 
            select(TokenDur:absSlopeF0),
          type="prob")
preds %>% head()

Then, as in the custom prediction function we defined above, you can use if_else() with a custom cutoff to convert continuous probabilities to binary.

bestCutoff <- mean(finalClassifier$resample$BestCutoff)
preds <- preds %>% 
  mutate(Rpresent = if_else(Present >= bestCutoff, "Present", "Absent") %>% factor())
preds %>% head()

In the case of our trainingData dataframe, where hand-coded and auto-coded tokens are interspersed, an extra step is required to add the predictions back into the dataframe. Here, we use the same trick as with the imputation dataframe in Step 1, creating a temporary dataframe for prediction that includes a unique identifier column (which we sneak out of the dataframe before prediction and sneak back in after prediction), joining classifier probabilities with the original trainingData dataframe, and recoding the Rpresent dependent variable only for the auto-coded tokens with non-NA classifier probabilities.

preds <-
  trainingData %>% 
  filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier, 
                 F1_70_Outlier, F1_50_Outlier, F3min_Outlier), 
            all_vars(!.)) %>% 
  filter(HowCoded=="Auto") %>% 
  select(MatchId, TokenDur:absSlopeF0)

preds <-
  predict(finalClassifier, preds %>% select(-MatchId), type="prob") %>% 
  cbind(preds %>% select(MatchId))

trainingData <- 
  trainingData %>% 
  left_join(preds %>% select(MatchId, ProbPresent = Present), by="MatchId") %>% 
  mutate(Rpresent = case_when(
    HowCoded=="Hand" ~ Rpresent %>% as.character(),
    HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent >= bestCutoff ~ "Present",
    HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent < bestCutoff ~ "Absent",
    TRUE ~ NA_character_
  ) %>% 
    factor())

rm(preds)

And now we’ve got auto-coded tokens! Here’s a sampling:

set.seed(302302)
trainingData %>% 
  filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier, 
                 F1_70_Outlier, F1_50_Outlier, F3min_Outlier), 
            all_vars(!.)) %>% 
  filter(HowCoded=="Auto") %>% 
  select(HowCoded, Rpresent, ProbPresent, TokenDur:absSlopeF0) %>% 
  sample_n(10)

Finally, we can extract resampled predictions (binary codes and classifier probabilities) for the hand-coded training set, as long as we’ve set savePredictions=TRUE inside trainControl(). Note that we do not recommend comparing classifier probabilities for hand-coded tokens to those of the auto-coded tokens because they are generated with slightly different processes (the auto-coded tokens are exposed to the entire classifier, the hand-coded tokens to a subset); nonetheless, obtaining the hand-coded probabilities can be useful for some applications (e.g., we used them to construct homogeneous vs. heterogeneous Present classes in the sample size simulations). The dataframe that results from the following code can be cbind()-ed to our existing hand-coded data. (Note also that this code works for the repeated k-fold cross-validation method we used, but the code would need to be tweaked for different methods of generating training/test subsets.)

finalClassifier$pred %>% 
  arrange(rowIndex) %>% 
  group_by(rowIndex) %>% 
  summarise(ProbPresent = mean(Present, na.rm=TRUE))

Session info

This guide was generated in an R session with the following information:

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ROCR_1.0-11       diptest_0.76-0    doParallel_1.0.16 iterators_1.0.13 
 [5] foreach_1.5.1     DMwR_0.4.1        knitr_1.33        caret_6.0-88     
 [9] lattice_0.20-44   ranger_0.13.1     magrittr_2.0.1    forcats_0.5.1    
[13] stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4       readr_2.0.2      
[17] tidyr_1.1.4       tibble_3.1.4      ggplot2_3.3.5     tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] nlme_3.1-152         fs_1.5.0             xts_0.12.1          
 [4] lubridate_1.7.10     httr_1.4.2           tools_4.1.0         
 [7] backports_1.2.1      bslib_0.3.0          utf8_1.2.2          
[10] R6_2.5.1             rpart_4.1-15         DBI_1.1.1           
[13] colorspace_2.0-2     nnet_7.3-16          withr_2.4.2         
[16] tidyselect_1.1.1     curl_4.3.2           compiler_4.1.0      
[19] cli_3.0.1            rvest_1.0.0          xml2_1.3.2          
[22] labeling_0.4.2       bookdown_0.22        sass_0.4.0          
[25] scales_1.1.1         proxy_0.4-26         digest_0.6.28       
[28] rmarkdown_2.9        pkgconfig_2.0.3      htmltools_0.5.2     
[31] highr_0.9            dbplyr_2.1.1         fastmap_1.1.0       
[34] TTR_0.24.2           rlang_0.4.11         readxl_1.3.1        
[37] quantmod_0.4.18      rstudioapi_0.13      farver_2.1.0        
[40] jquerylib_0.1.4      generics_0.1.0       zoo_1.8-9           
[43] jsonlite_1.7.2       ModelMetrics_1.2.2.2 Matrix_1.3-4        
[46] Rcpp_1.0.7           munsell_0.5.0        fansi_0.5.0         
[49] abind_1.4-5          lifecycle_1.0.1      stringi_1.7.4       
[52] pROC_1.17.0.1        yaml_2.2.1           MASS_7.3-54         
[55] plyr_1.8.6           recipes_0.1.16       crayon_1.4.1        
[58] haven_2.4.1          splines_4.1.0        hms_1.1.1           
[61] pillar_1.6.3         reshape2_1.4.4       codetools_0.2-18    
[64] stats4_4.1.0         reprex_2.0.0         glue_1.4.2          
[67] evaluate_0.14        data.table_1.14.2    modelr_0.1.8        
[70] vctrs_0.3.8          tzdb_0.1.2           cellranger_1.1.0    
[73] gtable_0.3.0         assertthat_0.2.1     xfun_0.24           
[76] gower_0.2.2          prodlim_2019.11.13   broom_0.7.8         
[79] e1071_1.7-7          class_7.3-19         survival_3.2-11     
[82] timeDate_3043.102    lava_1.6.9           ellipsis_0.3.2      
[85] ipred_0.9-11        

Generate test-script summary

##Pipe info from sacct into temp-file
sacctString <- paste0("sacct -j ", params$`SLURM_job-id`, " -o JobID,Elapsed,Partition,Reserved,ResvCPU,AllocCPUS,MaxRSS -P > tmp_", params$`SLURM_job-id`)
system(sacctString)
##Read info
info <- read_delim(paste0("tmp_", params$`SLURM_job-id`), delim="|", col_types="ctcttic")
resTime <- 
  info %>% 
  filter(!is.na(Partition),
         JobID==params$`SLURM_job-id`) %>% 
  pull(Reserved) %>% 
  replace_na(0) %>% 
  as.double(units="secs")
##Generate missing code if there's no resTime
if (length(resTime)==0) resTime <- "Reservation time missing"
##Remove temp-file--comment out for testing
file.remove(paste0("tmp_", params$`SLURM_job-id`))
[1] TRUE
##Outfile name
outfile <- paste0(if_else(params$testing, "Timing_TEST", "Timing"), ".Rds")

stopTime <- proc.time()

timingSmry <- list(
  jobID = params$`SLURM_job-id`,
  nodelist = params$SLURM_nodeid,
  testing = params$testing,
  partition = params$SBATCH_partition,
  nodes = params$SBATCH_nodes,
  tasks = params$`SBATCH_tasks-per-node`,
  cpus = params$`SBATCH_cpus-per-task`,
  mem = params$SBATCH_mem,
  reservTime = resTime,
  startTime = startSysTime,
  setupTime = setupEndTime - startTime,
  saveLoadTime = saveLoadEndTime - saveLoadStartTime,
  fstInitTime = fstInitEndTime - fstInitStartTime,
  resampTime = resampEndTime - resampStartTime,
  methodsTime = methodsEndTime - methodsStartTime,
  foldsRepsTime = foldsRepsEndTime - foldsRepsStartTime,
  tuneListTime = tuneListEndTime - tuneListStartTime,
  tunedTime = tunedEndTime - tunedStartTime,
  noOutTime = noOutEndTime - noOutStartTime,
  drop0Time = drop0EndTime - drop0StartTime,
  drop11Time = drop11EndTime - drop11StartTime,
  drop21Time = drop21EndTime - drop21StartTime,
  drop22Time = drop22EndTime - drop22StartTime,
  drop31Time = drop31EndTime - drop31StartTime,
  drop41Time = drop41EndTime - drop41StartTime,
  drop51Time = drop51EndTime - drop51StartTime,
  drop61Time = drop61EndTime - drop61StartTime,
  drop71Time = drop71EndTime - drop71StartTime,
  drop72Time = drop72EndTime - drop72StartTime,
  drop73Time = drop73EndTime - drop73StartTime,
  totalTime = stopTime - startTime
)

timingSmry %>% saveRDS(outfile)

timingSmry
$jobID
[1] 4481119

$nodelist
[1] "smp-n28"

$testing
[1] FALSE

$partition
[1] "smp"

$nodes
[1] 1

$tasks
[1] 4

$cpus
[1] 4

$mem
[1] 60000

$reservTime
[1] 107

$startTime
[1] "2021-10-06 11:00:13 EDT"

$setupTime
   user  system elapsed 
  2.363   0.302  20.958 

$saveLoadTime
   user  system elapsed 
  5.378   0.070   5.559 

$fstInitTime
    user   system  elapsed 
8970.641   13.795  894.351 

$resampTime
   user  system elapsed 
  3.561   0.500 401.408 

$methodsTime
   user  system elapsed 
  7.175   1.430 807.927 

$foldsRepsTime
    user   system  elapsed 
  22.678    4.988 3569.458 

$tuneListTime
    user   system  elapsed 
  37.177    9.464 3505.558 

$tunedTime
    user   system  elapsed 
2319.748   27.297  520.749 

$noOutTime
   user  system elapsed 
798.062   9.943 168.803 

$drop0Time
    user   system  elapsed 
   4.278    1.642 1109.866 

$drop11Time
    user   system  elapsed 
   4.166    1.651 1115.601 

$drop21Time
    user   system  elapsed 
   4.172    1.573 1022.830 

$drop22Time
    user   system  elapsed 
   4.165    1.604 1032.348 

$drop31Time
    user   system  elapsed 
   4.622    2.081 1002.444 

$drop41Time
   user  system elapsed 
  4.141   1.646 922.612 

$drop51Time
   user  system elapsed 
  4.110   1.811 910.248 

$drop61Time
   user  system elapsed 
  4.187   1.797 842.017 

$drop71Time
   user  system elapsed 
  4.107   1.707 825.402 

$drop72Time
   user  system elapsed 
  4.090   1.761 779.808 

$drop73Time
   user  system elapsed 
  4.092   1.699 809.448 

$totalTime
     user    system   elapsed 
14592.494   247.922 22315.589 

References

Breen, M. S., Thomas, K. G., Baldwin, D. S., & Lipinska, G. (2019). Modelling PTSD diagnosis using sleep, memory, and adrenergic metabolites: An exploratory machine-learning study. Human Psychopharmacology: Clinical and Experimental, 34(2), e2691. https://doi.org/10.1002/hup.2691

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior Research Methods (Online), 45(1), 83–97. https://doi.org/10.3758/s13428-012-0225-x

Fromont, R., & Hay, J. (2012). LaBB-CAT: An annotation store. Proceedings of Australasian Language Technology Association Workshop, 113–117. Retrieved from http://www.aclweb.org/anthology/U12-1015

Hartigan, J. A., & Hartigan, P. M. (1985). The Dip Test of Unimodality. Annals of Statistics, 13(1), 70–84. https://doi.org/10.1214/aos/1176346577

Hay, J., & Clendon, A. (2012). (Non)rhoticity: Lessons from New Zealand English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford Handbook of the history of English (pp. 761–772). Oxford: Oxford University Press.

Heselwood, B. (2009). Rhoticity without F3: Lowpass filtering and the perception of rhoticity in ’NORTH/FORCE,’ ’START,’ and ’NURSE’ words. Leeds Working Papers in Linguistics and Phonetics, 14, 49–64. https://doi.org/10.1.1.500.6321

Hudak, A. T., Crookston, N. L., Evans, J. S., Hall, D. E., & Falkowski, M. J. (2008). Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data. Remote Sensing of Environment, 112(5), 2232–2245. https://doi.org/https://doi.org/10.1016/j.rse.2007.10.009

Kuhn, M. (2018). Caret. Retrieved from https://CRAN.R-project.org/package=caret

Lawson, E., Stuart-Smith, J., & Scobbie, J. (2018). The role of gesture delay in coda /r/ weakening: An articulatory, auditory and acoustic study. Journal of the Acoustical Society of America, 143(3), 1646–1657. https://doi.org/10.1121/1.5027833

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Retrieved from https://CRAN.R-project.org/doc/Rnews/

Maechler, M. (2016). Diptest. Retrieved from https://cran.r-project.org/package=diptest

R Core Team. (2018). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/

Torgo, L. (2010). Data Mining with R, learning with case studies. Retrieved from http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR

Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01